Importing Libraries¶
# Import the Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
Data Loading¶
# load data
df = pd.read_csv("athlete_events.csv")
Data Understanding and Cleaning¶
# print first 10 columns
df.head(10)
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN |
| 1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN |
| 2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | NaN |
| 3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
| 4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NaN |
| 5 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 1,000 metres | NaN |
| 6 | 5 | Christine Jacoba Aaftink | F | 25.0 | 185.0 | 82.0 | Netherlands | NED | 1992 Winter | 1992 | Winter | Albertville | Speed Skating | Speed Skating Women's 500 metres | NaN |
| 7 | 5 | Christine Jacoba Aaftink | F | 25.0 | 185.0 | 82.0 | Netherlands | NED | 1992 Winter | 1992 | Winter | Albertville | Speed Skating | Speed Skating Women's 1,000 metres | NaN |
| 8 | 5 | Christine Jacoba Aaftink | F | 27.0 | 185.0 | 82.0 | Netherlands | NED | 1994 Winter | 1994 | Winter | Lillehammer | Speed Skating | Speed Skating Women's 500 metres | NaN |
| 9 | 5 | Christine Jacoba Aaftink | F | 27.0 | 185.0 | 82.0 | Netherlands | NED | 1994 Winter | 1994 | Winter | Lillehammer | Speed Skating | Speed Skating Women's 1,000 metres | NaN |
# Print the column names and data types
df.dtypes
ID int64 Name object Sex object Age float64 Height float64 Weight float64 Team object NOC object Games object Year int64 Season object City object Sport object Event object Medal object dtype: object
# copy the data frame to new data frame for analysis
df_new = df.copy()
# Check and drop duplicates
print(f"Duplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()
Duplicates: 1385
# Checking for null values
print("\n Sum of null values: \n")
df.isnull().sum()
Sum of null values:
ID 0 Name 0 Sex 0 Age 9315 Height 58814 Weight 61527 Team 0 NOC 0 Games 0 Year 0 Season 0 City 0 Sport 0 Event 0 Medal 229959 dtype: int64
# Fill missing Age values with the median age grouped by Sex, Sport, and Year
df_new['Age'] = df_new.groupby(['Sex', 'Sport', 'Year'])['Age'].transform(lambda x: x.fillna(x.median()))
# Check if there are still missing Age values
df_new.isnull().sum()
ID 0 Name 0 Sex 0 Age 3 Height 60171 Weight 62875 Team 0 NOC 0 Games 0 Year 0 Season 0 City 0 Sport 0 Event 0 Medal 231333 dtype: int64
# fill in the median age values for the remaining ages
df_new['Age'] = df_new['Age'].fillna(df['Age'].median())
# Percentage of missing values in original data
missing_height_pct = df['Height'].isnull().sum() / len(df) * 100
missing_weight_pct = df['Weight'].isnull().sum() / len(df) * 100
print(f"Missing Height: {missing_height_pct:.2f}%")
print(f"Missing Weight: {missing_weight_pct:.2f}%")
Missing Height: 21.80% Missing Weight: 22.81%
# Fill missing Height and Weight with median values grouped by Sex and Sport in df_new
df_new['Height'] = df_new.groupby(['Sex', 'Sport', 'Year'])['Height'].transform(lambda x: x.fillna(x.median()))
df_new['Weight'] = df_new.groupby(['Sex', 'Sport', 'Year'])['Weight'].transform(lambda x: x.fillna(x.median()))
df_new['Height'] = df_new['Height'].fillna(df['Height'].median())
df_new['Weight'] = df_new['Weight'].fillna(df['Weight'].median())
# Check if there are still missing Height and Weight values
df_new.isnull().sum()
ID 0 Name 0 Sex 0 Age 0 Height 0 Weight 0 Team 0 NOC 0 Games 0 Year 0 Season 0 City 0 Sport 0 Event 0 Medal 231333 dtype: int64
# if values for age, height and weight are negative, replace them with median
df_new.loc[df_new['Age'] <= 0, 'Age'] = df_new['Age'].median()
df_new.loc[df_new['Height'] <= 0, 'Height'] = df_new['Height'].median()
df_new.loc[df_new['Weight'] <= 0, 'Weight'] = df_new['Weight'].median()
print("Replaced invalid values with median.")
Replaced invalid values with median.
# Percentage of missing for medal
missing_height_pct = df['Medal'].isnull().sum() / len(df) * 100
print(f"Missing Medal Value in Percentage: {missing_weight_pct:.2f}%")
Missing Medal Value in Percentage: 22.81%
# fill the missing values with No Medal
df_new['Medal'] = df_new['Medal'].fillna("No Medal")
# make sure there are no null values
df_new.isnull().sum()
ID 0 Name 0 Sex 0 Age 0 Height 0 Weight 0 Team 0 NOC 0 Games 0 Year 0 Season 0 City 0 Sport 0 Event 0 Medal 0 dtype: int64
Additional Cleaning¶
# rename the Team column to Country
df_new.rename(columns={'Team': 'Country'}, inplace=True)
# Uniform capitalization with title case
df_new['Country'] = df_new['Country'].str.title()
# check the columns after the addtional changes - change of columns from Team to Country
df_new.head(10)
| ID | Name | Sex | Age | Height | Weight | Country | NOC | Games | Year | Season | City | Sport | Event | Medal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | No Medal |
| 1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | No Medal |
| 2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | 171.5 | 74.0 | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | No Medal |
| 3 | 4 | Edgar Lindenau Aabye | M | 34.0 | 175.0 | 70.0 | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
| 4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | No Medal |
| 5 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 1,000 metres | No Medal |
| 6 | 5 | Christine Jacoba Aaftink | F | 25.0 | 185.0 | 82.0 | Netherlands | NED | 1992 Winter | 1992 | Winter | Albertville | Speed Skating | Speed Skating Women's 500 metres | No Medal |
| 7 | 5 | Christine Jacoba Aaftink | F | 25.0 | 185.0 | 82.0 | Netherlands | NED | 1992 Winter | 1992 | Winter | Albertville | Speed Skating | Speed Skating Women's 1,000 metres | No Medal |
| 8 | 5 | Christine Jacoba Aaftink | F | 27.0 | 185.0 | 82.0 | Netherlands | NED | 1994 Winter | 1994 | Winter | Lillehammer | Speed Skating | Speed Skating Women's 500 metres | No Medal |
| 9 | 5 | Christine Jacoba Aaftink | F | 27.0 | 185.0 | 82.0 | Netherlands | NED | 1994 Winter | 1994 | Winter | Lillehammer | Speed Skating | Speed Skating Women's 1,000 metres | No Medal |
Understanding the ranges for Age, Height and Weight¶
# Calculate statistics for 'Age', 'Height', and 'Weight' including range
stats = df_new[['Age', 'Height', 'Weight']].agg(['mean', 'median', 'std', 'min', 'max'])
# Rename the custom function column for clarity
stats.rename(index={'data_range': 'range'}, inplace=True)
# Print the statistics
print(stats)
Age Height Weight mean 25.622475 175.127274 70.999576 median 25.000000 175.000000 70.000000 std 6.401233 9.736305 13.390853 min 10.000000 127.000000 25.000000 max 97.000000 226.000000 214.000000
Exploring: This shows that there huge range in the age, height and weight of the participants. However, the mean age remains to be around 25 where most of the atheletes are at their best shape in terms of fitness.
df_cleaned = df_new.copy()
df_cleaned.to_csv('olympic_data_analysis_cleaned.csv', index=False)
Exploratory Data Analysis (EDA)¶
1. Total Number of Athletes and Participation Over Years¶
# get the unique numebr of players - as we know atheletes represent mulltiple times from same country
total_athletes = df_new['Name'].nunique()
print(f"Total number of unique athletes: {total_athletes}")
Total number of unique athletes: 134732
# Get the total number of participants per year per season
participation_by_year = df_new.groupby(['Year', 'Season'])['Name'].nunique().reset_index()
participation_by_year.columns = ['Year', 'Season', 'Number of Participants']
print(" Number of Participation over Years\n")
print(participation_by_year.to_string(index=False))
Number of Participation over Years Year Season Number of Participants 1896 Summer 176 1900 Summer 1220 1904 Summer 650 1906 Summer 841 1908 Summer 2024 1912 Summer 2409 1920 Summer 2675 1924 Summer 3256 1924 Winter 313 1928 Summer 3246 1928 Winter 461 1932 Summer 1922 1932 Winter 252 1936 Summer 4482 1936 Winter 668 1948 Summer 4402 1948 Winter 668 1952 Summer 4931 1952 Winter 694 1956 Summer 3346 1956 Winter 821 1960 Summer 5348 1960 Winter 665 1964 Summer 5134 1964 Winter 1094 1968 Summer 5552 1968 Winter 1160 1972 Summer 7105 1972 Winter 1008 1976 Summer 6070 1976 Winter 1127 1980 Summer 5252 1980 Winter 1071 1984 Summer 6791 1984 Winter 1272 1988 Summer 8443 1988 Winter 1425 1992 Summer 9380 1992 Winter 1801 1994 Winter 1738 1996 Summer 10324 1998 Winter 2178 2000 Summer 10639 2002 Winter 2397 2004 Summer 10537 2006 Winter 2494 2008 Summer 10880 2010 Winter 2535 2012 Summer 10502 2014 Winter 2744 2016 Summer 11174
Remarks: There is overall increase in the number of the participation over the years. If you notice there are little bumps in the numbers, that is beacause there are more summer olympics participants than winter olympics.
2. Top Participating Countries¶
# group by the NOC and unique ids of the participants (we don't want them to repeat)
country_participation = df_new.groupby('NOC')['ID'].count().sort_values(ascending=False).reset_index().rename(columns={'ID': 'Number of Participants'})
print("Top Participating Countries \n")
country_participation.head(15)
Top Participating Countries
| NOC | Number of Participants | |
|---|---|---|
| 0 | USA | 18853 |
| 1 | FRA | 12758 |
| 2 | GBR | 12256 |
| 3 | ITA | 10715 |
| 4 | GER | 9830 |
| 5 | CAN | 9733 |
| 6 | JPN | 8444 |
| 7 | SWE | 8339 |
| 8 | AUS | 7638 |
| 9 | HUN | 6607 |
| 10 | POL | 6207 |
| 11 | SUI | 6150 |
| 12 | NED | 5839 |
| 13 | URS | 5685 |
| 14 | FIN | 5467 |
3. Top 10 Sports with Most Events¶
# Group by the Sports and IDs of atheletes
top_sports = df_new.groupby('Sport')['ID'].count().reset_index()
top_sports.columns = ['Sport', 'Event Count']
top_sports = top_sports.sort_values(by='Event Count', ascending=False).head(10)
# add ranks for clarity
top_sports['Rank'] = range(1, 11)
# print out the columns to show Rank first 10
top_sports = top_sports[['Rank', 'Sport', 'Event Count']]
print(top_sports.to_string(index=False))
Rank Sport Event Count
1 Athletics 38624
2 Gymnastics 26707
3 Swimming 23195
4 Shooting 11448
5 Cycling 10859
6 Fencing 10735
7 Rowing 10595
8 Cross Country Skiing 9133
9 Alpine Skiing 8829
10 Wrestling 7154
4. Athlete Representation in Summer vs. Winter¶
# groupby season and name of the athletes (only unique values to exclude repetition)
season_representation = df_new.groupby('Season')['Name'].nunique().reset_index()
season_representation.columns = ['Season', 'Athlete Count']
print(season_representation.to_string(index=False))
Season Athlete Count Summer 116122 Winter 18923
Observation: This shows that there are more participation in the summer olympics. As mentioned above, there are generally seem to be more participation for Summer Olympics.
Data Visualization and Interpretation¶
1. Medal Trends Over Time by Gender¶
# Separate the dataset into Summer and Winter Olympics data
summer_data = df_new[df_new['Season'] == 'Summer']
winter_data = df_new[df_new['Season'] == 'Winter']
# Excludes "No Medal" columns and groupby the year, season and sex
medal_trends = df_new[df_new['Medal'] != 'No Medal'].groupby(['Year', 'Season', 'Sex'])['ID'].count().reset_index()
medal_trends.columns = ['Year', 'Season', 'Sex', 'Medal Count']
plt.figure(figsize=(12, 6))
# only incluide the unique values
for season in medal_trends['Season'].unique():
subset = medal_trends[medal_trends['Season'] == season]
# further loop through each gender for the current season
for gender in subset['Sex'].unique():
gender_subset = subset[subset['Sex'] == gender]
plt.plot(gender_subset['Year'], gender_subset['Medal Count'], label=f"{season} - {gender}")
plt.title("Medal Trends Over Time by Gender and Season")
plt.xlabel("Year")
plt.ylabel("Number of Medals")
plt.legend(title="Season & Gender")
plt.grid()
plt.show()
Interpretation and Observation:
In order to understand the trends about the male and female participation over the years, the data can be grouped by the Year, Season and Sex. So, as seen above there is increase in overall female participation over the years since the gap between the males and females for both the season are decreasing. After, the around year 1980 there is rapid rise in the female atheletes for the Summer Olympics. The similar trends can be seen in Winter Olympics after 1984.
2. Heatmap of Medals by Sports and Years¶
# Group by Decade and Sport
heatmap_data_decade = (
df_new[df_new['Medal'] != 'No Medal']
.groupby([(df_new['Year'] // 10) * 10, 'Sport'])['ID']
.count()
.unstack(fill_value=0)
)
# Create the heatmap
plt.figure(figsize=(23, 16))
sns.heatmap(heatmap_data_decade, cmap='YlGnBu', linewidths=0.5, linecolor='gray')
plt.title("Heatmap of Medals by Sports and Decades", fontsize=18)
plt.xlabel("Sport", fontsize=18)
plt.ylabel("Decade", fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.show()
Interpretation and Observation:
From the heatmap, there are can few of the things that can be interpreted:
- Certain sports like Athletics, Rowing, Football, Ice Hockey, and Swimming have shown increasing popularity, consistently awarding more medals in recent decades compared to the past.
- Sports such as Gymnastics, Fencing, and Shooting have experienced fluctuating popularity, with noticeable peaks and declines over different decades. This indicates shifts in athlete participation, audience interest, or perhaps changes in event availability over time.
3. Athlete Count vs. Medals Won¶
# Count unique athletes and medals by country
country_stats = df_new.groupby('NOC').agg({
'Name': 'nunique', # Unique count of athletes
'Medal': lambda x: (x != 'No Medal').sum() # Count of medals (excluding 'No Medal')
}).reset_index()
# Rename columns for clarity
country_stats.columns = ['Country', 'Athlete Count', 'Medal Count']
plt.figure(figsize=(10, 6))
plt.scatter(country_stats['Athlete Count'], country_stats['Medal Count'], alpha=0.7)
plt.title("Athlete Count vs. Medals Won by Country")
plt.xlabel("Number of Athletes")
plt.ylabel("Number of Medals")
plt.grid()
plt.show()
Interpretation and Observation:
- The scatter plot reveals a positive correlation between the number of athletes sent by a country and the number of medals won. This relationship is intuitive, as countries that send larger delegations generally have a higher chance of securing more medals due to increased representation across different events.
- There are few outliers who have manages to have sent a relatively small number of atheletes but manage to win significant number of medals but they are less of these.
4. Season-Specific Medal Trends¶
# Filter data for rows with medals
medal_data = df_new[df_new['Medal'] != 'No Medal']
# Group by Season and count medals
season_medals = medal_data.groupby('Season')['ID'].count().reset_index()
season_medals.columns = ['Season', 'Medal Count']
print(season_medals)
Season Medal Count 0 Summer 34088 1 Winter 5695
plt.figure(figsize=(8, 6))
plt.bar(season_medals['Season'], season_medals['Medal Count'], color=['#FFA07A', '#87CEEB'])
plt.title("Total Medals Won: Summer vs Winter Olympics")
plt.xlabel("Season")
plt.ylabel("Number of Medals")
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Interpretation/Observation:
- The bar chart shows a significantly higher number of medals awarded in the Summer Olympics compared to the Winter Olympics. This is largely due to the fact that the Summer Olympics feature a greater number of sports and events, allowing for more athletes to participate and more medals to be awarded.
- This difference in medal counts is a reflection of larger athlete participation in the Summer Olympics, as there are simply more events and opportunities to compete.
5. BMI Distribution by Sport¶
# Box plot to see how BMI varies over different sports
plt.figure(figsize=(16, 8))
sns.boxplot(
x='Sport',
y=df_new['Weight'] / (df_new['Height'] / 100) ** 2, # Calculate BMI directly
data=df_new,
showfliers=False
)
plt.title("BMI Distribution by Sport")
plt.xlabel("Sport")
plt.ylabel("BMI")
plt.xticks(rotation=90)
plt.show()
Interpretation/Observation:
- From the data, it’s evident that different sports exhibit distinct BMI characteristics, which is likely due to the varying physical requirements for each sport.
- Sports such as Weightlifting, Wrestling, and Judo tend to have a wider range of BMI values. This is because these sports have different weight categories, which means athletes can range significantly in size and muscle mass.
- Sports like Tug-Of-War and Rugby Sevens generally have higher average BMI values. This is because these sports require a lot of strength, muscle mass, and power, which naturally correlates with a higher BMI.
- In contrast, sports like Rhythmic Gymnastics and Synchronized Swimming tend to have lower average BMI values. These sports demand agility, flexibility, and endurance, which often means athletes maintain a leaner physique.
Based on Demographics¶
Understanding participations based on age¶
# Histogram to understand distribution of athletes of different age
fig = px.histogram(df_new, x ='Age', nbins=60, title='age distribution')
fig.show()
Interpretation/Observation
Anaylyzing women participation over the years¶
# Group the data by 'Year' and 'Sex' and count the number of participants
grouped_data = df_new.groupby(['Year', 'Sex']).size().reset_index(name='Count')
# Bar graph to understand distribution of both sexes over the years
fig = px.bar(grouped_data, x ='Year', y = 'Count',
color='Sex',
barmode='group',
title= 'Number of Men and Women Participating Each Year',
labels={'Participants': 'Number of Participants', 'Year': 'Year'})
fig.update_layout(
xaxis_title='Year',
yaxis_title='Number of Participants',
title_x=0.5,
xaxis_tickangle=45 # Rotate x-axis labels for better readability
)
fig.show()
Understanding Height and Weight based on Sex¶
# Box plot to show Height based on Sex
fig = px.box(df_new, x='Sex', y='Height', color ='Sex', title='Sex Vs Height')
fig.show()
# Box plot to show Weight based on Sex
fig = px.box(df_new, x='Sex', y='Weight', color ='Sex', title='Sex Vs Weight')
fig.show()
Based on Teams and Medal wins¶
# Clean up data to visualize the medal wins
# Drop None values
df_filtered = df_new[df_new['Medal'] != 'No Medal']
# Group by the Team names
df_medal = df_filtered.groupby(['Country', 'Medal']).size().reset_index(name='Count')
# Pivot to create separate columns for each medal type and fill none values with zeros
df_medals_pivot = df_medal.pivot(index='Country', columns='Medal', values='Count').fillna(0)
# Add a Total column
df_medals_pivot['Total'] = df_medals_pivot.sum(axis=1)
# Reset the index to turn it back into a DataFrame
df_medals_pivot = df_medals_pivot.reset_index()
# Sort by each medal type and extract top 10 teams
top_gold = df_medals_pivot.sort_values(by='Gold', ascending=False).head(10)
top_silver = df_medals_pivot.sort_values(by='Silver', ascending=False).head(10)
top_bronze = df_medals_pivot.sort_values(by='Bronze', ascending=False).head(10)
# Plot Bar to show the top winning teams
grouped_data = df_new.groupby(['Year', 'Sex']).size().reset_index(name='Count')
# Combine the top 10 data for each category
top_combined = pd.concat([top_gold.assign(Medal='Gold'), top_silver.assign(Medal='Silver'),top_bronze.assign(Medal='Bronze')])
# Create a grouped bar chart
fig = px.bar(
top_combined,
x='Country',
y=['Gold', 'Silver', 'Bronze'],
title='Top 10 Teams by Medal Categories',
labels={'value': 'Medal Count', 'variable': 'Medal Type'},
barmode='group'
)
fig.show()
Expected Output¶
Height Distribution of Medalists by Gender¶
# Male and female heights
male_heights = df_new[df_new['Sex'] == 'M']['Height']
female_heights = df_new[df_new['Sex'] == 'F']['Height']
# Plot histogram
plt.figure(figsize=(8, 5))
plt.hist(
[female_heights, male_heights],
bins=20,
edgecolor='black',
alpha=1.0, # No transparency to avoid overlapping colors
color=['plum', 'lightskyblue'],
label=['Female', 'Male'],
stacked=True
)
# Add labels and titles
plt.title("Height Distribution of Medalists by Gender")
plt.xlabel("Height (cm)")
plt.ylabel("Frequency")
plt.legend(title="Sex")
plt.grid(True)
plt.tight_layout()
plt.show()
Interpretation/Observation:
- There are significant differences in the height distributions of male and female medalists.
- Male athletes are, on average, taller than female athletes, with their distribution being centered around a higher mean. The majority of female athletes have a height between 155 cm and 175 cm, with a peak near 165 cm. This reflects the natural height differences between genders and highlights that the majority of female athletes are shorter compared to their male counterparts.
- The differences in height distributions may be influenced by the types of sports that athletes compete in.
# Group data by Year and Sex to count participants
gender_representation = df_new.groupby(['Year', 'Sex'])['ID'].count().unstack()
# Plot bar chart
gender_representation.plot(
kind='bar',
stacked=True,
figsize=(12, 6),
color=['lightblue', 'chocolate'],
edgecolor='black'
)
# Add title and labels
plt.title("Gender Representation Over Time")
plt.xlabel("Year")
plt.ylabel("Number of Athletes")
plt.legend(title="Sex", labels=["Female", "Male"])
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
# Show the plot
plt.show()
Interpretation/Observation:
- There is rise in number of the females over the years.
- We can clearly observe the overall number of increase of athletes participations. However, the gap between the number of male and female participants seem to be decreasing.
- There are significantly more number of participants in Summer olympics than Winter Olympics. The possible reasons could be there are more sports included in the Summer Olympics compared to Winter.
Insights and Generalizations¶
Additional Analysis/Findings (Exploring more)¶
# Bar plot to show number of male/female representation
# Group data by Year and Sex to count participants
gender_representation = df_new.groupby(['Year', 'Sex'])['ID'].count().unstack()
gender_representation.plot(
kind='bar',
stacked=True,
figsize=(12, 6),
color=['lightblue', 'chocolate'],
edgecolor='black'
)
# Add title and labels
plt.title("Gender Representation Over Time")
plt.xlabel("Year")
plt.ylabel("Number of Athletes")
plt.legend(title="Sex", labels=["Female", "Male"])
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Understand distribution of athletes of different age¶
# Histogram to understand distribution of athletes of different age
fig = px.histogram(df_new, x ='Age', nbins=60, title='Age Distribution')
fig.show()
Observation
- The histogram indicates that the age of participants ranges widely, from close to 10 years old to nearly 90 years old. This suggests a diverse set of participants in terms of age, depending on the type of sport and its physical demands.
- The majority of athletes fall within the 25 to 30-year-old range. This age range likely represents the peak performance years for many athletes, where they have the optimal combination of experience, physical strength, and fitness.
Understanding the participation of each sexes over the years¶
# Group the data by 'Year' and 'Sex' and count the number of participants
grouped_data = df_new.groupby(['Year', 'Sex']).size().reset_index(name='Count')
# Bar graph to understand distribution of both sexes over the years
fig = px.bar(grouped_data, x ='Year', y = 'Count',
color='Sex',
barmode='group',
title= 'Number of Men and Women Participating Each Year',
labels={'Participants': 'Number of Participants', 'Year': 'Year'})
fig.update_layout(
xaxis_title='Year',
yaxis_title='Number of Participants',
title_x=0.5,
xaxis_tickangle=45 # Rotate x-axis labels for better readability
)
fig.show()
Observation
- There are significant number of female participants over the years.
- There are no female participants before 1900, indicating that women were largely excluded from the early Olympic Games. Female participation began to appear in the early 20th century, as more sports began allowing female competitors, marking a shift towards gender inclusivity in the Olympics.
Note: You will see no participants under years - 1916, 1940 and 1944. Due to World War I & II, the Olympics were postponed.
Analyze relation between Height, Weight and Sex¶
# Box plot to show Height based on Sex
fig = px.box(df_new, x='Sex', y='Height', color ='Sex', title='Sex Vs Height')
fig.show()
Observation
- The median height for the males is more than the females. Also, there are good number of males with greater height than female.
# Box plot to show Weight based on Sex
fig = px.box(df_new, x='Sex', y='Weight', color ='Sex', title='Sex Vs Weight')
fig.show()
Observation
- The median weight of male athletes is noticeably higher than that of female athletes. This is expected given the natural differences in body composition and the physical demands of many sports where male athletes compete.
- These observations highlight the differences in physical attributes between male and female athletes, which are influenced by both biological factors and the types of sports they participate in.
Summary¶
Demographics and Participation Trends
Increasing Participation Over Time:
Both male and female participation has grown significantly since the inception of the modern Olympics. The growth in female athletes is particularly notable, with increased representation especially after the 1960s, reflecting changing societal norms and a shift towards gender inclusivity.Seasonal Trends: The Summer Olympics have consistently attracted a larger number of participants and awarded more medals compared to the Winter Olympics, mainly due to a broader range of sports and events.
Medal Distribution Insights
Age and Medal Success: The analysis suggests that the majority of medals are won by athletes in their mid-20s to early 30s.Countries with Higher Participation: Countries that send larger delegations tend to win more medals, indicating a positive correlation between the number of participating athletes and medal success. However, outliers exist—certain countries are able to win a substantial number of medals despite sending fewer athletes, likely due to targeted training programs or specialization in particular sports.
Physical Attributes
Height Differences: Male athletes are generally taller, reflecting biological factors and sport-specific physical demands.Weight Differences: Male athletes also have higher median weights, especially in sports like weightlifting and wrestling, which require strength and power.BMI Trends by Sport: High BMI is common in power-based sports (e.g., weightlifting, rugby), while lower BMI is typical in agility-focused sports (e.g., gymnastics, long-distance running).
Sports-Specific Trends
Growth in Popularity: Sports like Athletics, Swimming, and Football have seen increased participation and a growing medal count over time.
In conclusion, the Olympic Games have evolved significantly since their inception, moving from a predominantly male-dominated arena to one that encourages gender inclusivity and diversity. The growth in both number of sports and athlete participation highlights the expanding appeal and scope of the games.